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1. INTRODUCTION 

In a recent trend, an optimization-based clustering algorithm is used to cluster complex problems in 
various environments in different situations. This clustering algorithm is mainly projected on clustering the 
data by nearest neighboring of flow depth by the natural behavior and characteristic flow of rainwater. This 
hybrid optimization-based clustering algorithm also used some mathematical calculations for updating the 
next nearest neighbor location and velocity from one position to another position. Here land space location is 
used to locate all the drops of rainfall, which is considered as land space data (LSD). All drops are considered 
together as a flood, that is total data in the located land space (dataset). It can move from its current position 
to another position based on the condition of the location such as a river, drains, pond, lake, and some other 
storage locations. 

Optimization based algorithms such as sun flower optimization (SFO) [1], rider optimization 
algorithm (ROA) [2], gray wolf optimization (GWO) [3], particle swarm optimization (PSO) [4] and genetic 
algorithm [5], are very powerful algorithms in machine learning under data mining, that is the sub branch of 
artificial intelligence [6], [7]. There are three different models of machine learning such as unsupervised 
learning, supervised learning and reinforcement learning. Depends on the input and some other features it is 
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defined as supervisor or without supervisor or situation like failure or success [8]. Then another form of 
machine learning is called semi-supervised, that is the combination of with supervisor (classification) and 
without supervisor (clustering) [9]-[11]. 

The main inspiration of designing this novel proposed data clustering algorithm is to produce better 
optimum clustering solutions with faster convergence. To produce an optimum clustering solution this paper 
proposed fuzzy-C-means (FCM) based on rainfall flow optimization (RFFO) clustering algorithm for medical 
data. In this problem, there are three steps; i) data preprocessing, ii) feature selection and iii) data clustering 
by a proposed data clustering algorithm. 

The proposed solution is designed based on two important algorithmic concepts. First to produce an 
optimum clustering solution here used FCM clustering. Then RFFO is used to find optimum clustering 
centroid. FCM clustering algorithm [12], [13] is the most popular clustering algorithm based on a 
mathematical logical model. RFFO [14] approach works based on the natural rainfall flow behavior. The 
open land source (OLS) is suitable for locating raindrops, which is referred to as OLSD (open land source 
data) [15], [16]. The raindrops are poised together and well-thought-out as a torrent, that is total fed data [17]. 
In some situations, and places, the water dew will not move to any other places, which is called data 
stagnation. When the raining, the collected raindrops (water) flow from one position to another position and 
get together as flood, which is called as clustering of data. The stowage space jerks raindrops by its least 
distance and slant by its acceleration. The total fed data (collected raindrops) is called the torrent. The water 
stowage is based on the minimum distance of stowage location, max depth, maximum storage size, condition 
of stowage location such as soil, climate, nature of wetlands, and artificial lake. Today real-world 
applications have various types with heterogeneous data sets with dissimilar features [18]. For solving all 
these complex problems this paper presented a novel optimization-based clustering algorithm, which is the 
FCM based RFFO algorithm. The main challenges of medical data clustering are to handle data 
preprocessing to find missing data, noise data, data inconsistency, and redundant data in data mining [19]. 
The visual representation of the projected FCM clustering algorithm based on RFFO technique for medical 
data is shown in Figure 1. Today medical data clustering is very vast and intricate, due to the large size of 
receiving data, hidden information, massive volume, and its most frequency. 
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Figure |. Visual representation for FCM on RFFO algorithm 


2. LITERATURE REVIEW 

In this literature review given five different latest existing medical data-based clustering methods 
with its drawback. In 2018 Yelipe et. al designed an imputation based on class-based clustering, which is 
simply called IM-CBC to identify and evaluate the similarity between the two medical records. This paper 
used Euclidean distance to find the similarity between the clusters with fuzzy similarity functions. Then, 
classification is also done with the help of classification methods, such as SVM, C4.5, or k-nearest neighbors 
(KNN). Here the performance is given as high accuracy. At the same time, this method is not examined fuzzy 
measures for data classification and predicting the results based on given medical data [20]. Then in 2018 
Das et al proposed a modified bee colony optimization (MBCO) technique for clustering the medical data 
with the combination of K-means clustering algorithm with chaotic theory for faster convergence. This 
method compared with other clustering methods and shown MBCO produced faster convergence. But this 
hybrid method does not adapt for multi-objective functions and is not processed for high-frequency data 
streams [21]. Then in 2019 Chauhan et al. given a two-step clustering technique to analyze the patient’s 
disorders by using different variables to and determine the earlier stage of the liver disease from the hidden 
knowledge [22]. In 2020 Yu et.al. [23] designed medical data clustering and feature extraction by using 
immune evolutionary algorithm under cloud computing for big data. Here the final results produced the better 
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accuracy of data classification, improve the performance for medical data. Here the final results produced the 
better clustering solution by optimum clustering centroid. Then this algorithm is also compared with the 
existing algorithm [14]. 


3. THE PROPOSED CLUSTERING METHOD 

By getting advantages of both traditional clustering and global optimization-based technique for 
optimum centroid here, proposed hybrid FCM based on RFFO techniques for medical data. RFFO can 
produce optimum clustering cenroid. Based on the optimized clustering centroid FCM clustering algorithm 
can produce better clustering solution. 


3.1. Introduction about clustering 

In clustering, the collection of data items is grouped into a set of disjoint classes, which is the sub 
branch of unsupervised learning in machine learning [24], [25]. Here are different forms of clustering 
algorithms from traditional clustering algorithms to global optimization-based clustering techniques. This 
paper used optimization-based clustering by using hybrid FCM based RFFO clustering algorithm. 


3.2. Fuzzy C-means clustering 

FCM clustering algorithm is also simple clustering under fuzzy logic. (It can have the value 0 and 
1), that is mathematical logical model-based partitioning clustering [26]. The core objective of the FCM 
algorithm is the minimum cost function OFCM using Euclidean distance by (1). 


Orcu = O(W,V) = SX, YS_.(W,)4 |B, — Gl)’ (1) 


Where fuzzification degree, here i=/, .. ., X, and j=/, . . ., C that is membership matrix. Then B; is the ith 
dimension of the given data, and the jth dimension of the cluster center is C;, Then the cluster center will be 
updated by (2); 
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Here the updated fuzzy membership matrix is calculated by (3); 
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3.3. Pseudocode for FCM clustering 

Consider B = {B,, Bz, B3, ...B,} for the data point sets and the cluster centers C={Cej, Ce2, Cej}. 
Initially the centers of each cluster are selected randomly, then the fuzzy membership value will be the 
computer, Wj by (3), calculate center Cj for the fuzzy cluster center by (2), reiterate steps 2 and 3 till the 
defined number of iterations or if it is less than the given threshold value or there is no improvement. 


(wis) - (wy)"|] < € (4) 


Here, the iteration step k, then the expiry condition. This FCM iteration stops when the value of the partition 
matrix is less than, which is definite as 0.0001. FCM is a little more beneficial than K-means. But it has also 
some shortcomings than global optimization-based clustering. The drawback of the FCM is a sensibility for 
initialization of cluster centroid and premature convergence. 


3.4. Procedure for proposed hybrid FCM based RFFO 
FCM depend on the primary membership matrix values. Based on probability distribution, the 
candidate data is selected, which is performed by random initialization. The algorithmic steps for hybrid 
FCM based RFFO is shown as; 
— Step 1: Initialization. Initialize Fmax, Maximum iteration numbers (Itmax) acceleration coefficient AC/ 
and, AC2, Flood best (Fy = «), Depth best (D, = ). Then initial cluster centroids will be selected 
randomly, 
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— Step 2: Evaluate fitness function. Here each and every iteration does, calculate the fitness function Ffit by 
using (6). 


Frit = Orcu/K (6) 


Orcm is calculated by (1), that is the objective function, then total number of clusters are represented by 
“K” for FCM algorithm, with minimum cost n with the help of distance measure as Euclidean distance 
formula. Again, centroid will be calculated for cluster by using (2); For finding optimum centroid of the 
cluster use RFFO using (6). 
— Step 3: Velocity and position updation. By (7) and (8) RFFO’s position and velocity are updated. The 
updated position is calculated by, 


P(t+1)=P(t) + ¥(t +1) (7) 
Then the updated velocity is calculated by; 

Y¥,(t+1) = Y,(t) + C1*A1(Db — Pi(t)) +€2+* A2(Fb — Pi(t)) (8) 
Here Yi(t) is computed by using the (9); 

Y;(t) = (Kci * Ax)/ WP (9) 
The gradient Ax, that is M, computed by the (10); 

M = (a2 — a1)/(b2 — b1) (10) 


Here there are two slope coordinates such as (al, b1) and (a2, b2), then P,(t) is the present particle’s 
position at ¢. Particle’s next position will be updating Pi(t+/) at (t+/). Wp is water absorbency, that is 
overall suckled data. D, denotes personal best location. Then F, is the globally best solution. The 
hydraulic conductivity (K.;) value from 0.8 to 0.95, and the capillary constant ACI, AC2 coefficient, 
which is 2.0 and the values of random variables ranging from X; to X, considered from 0 to 1. 
— Step 4: Defining the optimum centroid by RFFO. RFFO method is used to define optimum clustering 
centroid to produce better clustering solution. 
— Step 5: Termination Condition. Iterate the steps 2 to 3 until the extreme or determined number of iteration 
count reaches. 


4. RESEARCH METHOD 

The research method is designed based on two important algorithmic concepts. First to produce an 
optimum clustering solution here used FCM clustering. Then RFFO is used to find optimum clustering 
centroid. Today real-world applications have various types with heterogeneous data sets. For solving all these 
complex problems this paper presented a novel optimization-based clustering algorithm, which is the FCM 
based RFFO algorithm. The main problem of medical data clustering are to handle data preprocessing to find 
missing data, noise data, data inconsistency, and redundant data in data mining. This fuzzy clustering is 
implemented using the Python 3.8.6 in Windows 10 operating system, intel i5 core processor. For this 
experimentation taken 300 persons real medical checkup data to predict the symptoms of heart disease. The 
experimental results also compared with existing methods and shown the performance measure based on 
accuracy, Jaccard coefficient and random coefficient. 


5. EXPERIMENTAL RESULTS AND DISCUSSION 

For the medical data clustering, heart disease data has been taken for experimentation to analyze and 
forecast the risk factors of heart disease. Heart disease data were collected from Johnson Jims, a staff nurse 
from Kuwait based on the reference. From this, we can provide various suggestions for each type of 
clustering. For low symptoms of heart disease, we can provide suggestions to take healthy food, and doing 
exercise then average risk factors, can provide the suggestions such as food diet, walking distance, exercise 
to do, and any medicine to take. Then for high risk of heart disease can suggest more concentrate on a food 
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diet, walking, regular medical checkup, and exercise to do. Figure 2 detailed about 2-dimensional (2D) view 
of clustering results for cholesterol vs age, body mass index vs age and glucose level vs age. Then in Figures 
2(a) to 2(c) are shown the simulation results in the 2D model using FCM based RFFO with different features 
of medical data in the sense early phase and ending phase of the clusters. 

The Figure 2(a) shows the 2D simulation result for age vs cholesterol. Here green color shows less 
symptoms of heart disease risk factor, then the blue color denotes the average risk factor of heart disease and 
finally the red color shows the high-risk factor of heart disease. Similar that the Figure 2(b) shows the 2D 
simulation result for age vs body mass index, and the Figure 2(c) figured for age vs glucose level in the form 
of 2D simulation. 
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Figure 2. 2D clustering result for (a) age vs cholesterol, (b) age vs body mass index, and (c) age vs glucose 
level using FCM based RFFO 
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5.1. Comparative study analysis 

Figures 3(a) to 3(c), are shown the qualified comparative study analysis by using input size for the 
above performance measures. The input size of the comparative study analysis is varying from 50 to 300. 
The study analysis is shown based on accuracy measure Jaccard coefficient and random coefficient. When 
input size is 50, the corresponding accuracy values are computed by existing K-means, K-harmonic means 
(KHM), FCM, K-means+RFFO and proposed RFFO-based FCM. Likewise, accuracy, Jaccard coefficient 
and random coefficients are also calculated for the input size 300. 
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Figure 3. The comparative study analysis for the performance measures (a) accuracy b) Jaccard coefficient, 
and (c) Rand coefficient based on input size 


Here the Figure 3(a) shows the comparative study analysis by using accuracy measure, which is 
computed for existing models such as K-means, KHM, FCM, RFFO+K-means and proposed FCM based 
RFFO at the input size 50 are 50.263%, 57.327%, 63.23%, 66.2387% and 69.748 respectively. Similar that, 
accuracy is calculated by using existing K-means, KHM, FCM, RFFO+K-means and proposed FCM based 
RFFO at the input size 300 are 79.545%, 80.321%, 81.532%, 90.234% and 91.234% respectively. The Figure 
3(b) shows the comparative study analysis by using Jaccard coefficient, which is computed for existing 
models such as K-means, KHM, FCM, RFFO+K-means and proposed FCM based RFFO at the input size 50 
are 32.123%, 40.437%, 46.438%, 70.297% and 72.748% respectively. Similar that, Jaccard coefficient is 
calculated by using existing K-means, KHM, FCM, RFFO+K-means and proposed FCM based RFFO at the 
input size 300 are 63.09%, 73.555%, 77.3825%, 90.626% and 91.614% respectively. The Figure 3(c) shows 
the comparative study analysis by using random coefficient, which is computed for existing models such as 
K-means, KHM, FCM, RFFO+K-means and proposed FCM based RFFO at the input size 50 are 47.231%, 
48.487%, 53.208%, 65.767% and 68.101% respectively. Similar that, random coefficient is calculated by 
using existing K-means, KHM, FCM, RFFO+K-means and proposed FCM based RFFO at the input size 300 
are 71.653%, 71.985%, 76.326%, 90.534% and 91.767% respectively. 
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5.2. Performance metric 

The performance measure for the hybrid FCM based RFFO clustering algorithm is employed by 
accuracy, random coefficient, and Jaccard coefficient which are given in the following section, by using 
accuracy data quality can be calculated from true positives ACT?, true negatives ACT", false positives ACF? 
and false negatives ACF"; 


ACTP+ACT" 
Accuracy = 5 crpactT™+AcFP+ACP® (11) 
Here ACT? , ACT", ACF?, ACF" are the parameters. 
Jaccard coefficient measure is used to find similarities by comparing two data clusters; 
_ |Unv| 
Jack(U, V) = rater (12) 


Here U and V are two different clusters. 

Random coefficient is the third performance measure, which is used to find the ratio of correct decision. 
Rand coefficient is calculated to estimate the right clustered pairs and the equation to compute and coefficient 
iS as, 


Correct similar pairs + Correct dissimilar pairs 


Random Coefficient= (13) 


Total possible pairs 


5.3. Comparative analysis table based on performance measure 

The given Table | analyze the above three performance measures. The maximal accuracy Jaccard 
coefficient and random coefficient for the proposed FCM based RFFO are 91.234%, 89.614%, and 92.767%. 
Here the maximal accuracy is acquired by proposed RFFO-based FCM with accuracy of 91.234%, whereas 
the accuracy of existing K-means, KHM, FCM and K-means based RFFO are 79.545%, 80.231%, 81.534% 
and 90.166% respectively. Likewise, the input size for the maximal Jaccard coefficient and random 
coefficient also given in that Table 1. 


Table 1. Comparative analysis 


Input Comparative metrics K-means KHM FCM KM+RFFO | FCM+RFFO 
Input size Accuracy (%) 79.545 80.321 81.534 90.166 91.234 
Jaccard coefficient (%) 63.09 73.555 77.382 87.626 89.614 
Random coefficient (%) 71.653 71.985 76.326 90.534 92.767 


6. CONCLUSION 

Thus, the paper proposed an optimization-based clustering algorithm with the name of a hybrid 
RFFO based FCM clustering algorithm for medical data. Here heart disease-based medical data has been 
taken and projected the model with optimal data clustering. The final data clustering was done by FCM based 
RFFO algorithm for medical data. The proposed success is achieved for FCM based RFFO algorithm with 
maximal accuracy 91.234%, Jaccard coefficient 89.614% and Rand coefficient 92.767%. The main 
advantages of this hybrid optimization-based clustering algorithm combine the advantages of both algorithms 
like fast convergence of traditional clustering FCM algorithm and to produce better centroid by using 
optimization-based method RFFO. So, this hybrid algorithm can avoid premature convergence and it can also 
produce optimum centroid. In the future, this model can be extended by multi-objective functions for a more 
effective and better clustering centroid. This will help the doctors to take proper decisions from the immense 
needs with huge data size. 
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